Here, demonstrate …
To do this, we use five genomes from NCBI that span roughly 300M years of vertebrate evolution.
The best way to run GENESPACE is to open a terminal and enter a conda
environment (or similar) where orthofinder is in the path.
OrthoFinder can be installed in many ways (see README. You also need
to ensure that MCScanX is properly installed. Once you have those
completed, from the terminal, enter: open -na rstudio,
which will open an Rstudio interface (if installed). Otherwise, just
enter an interactive R session by executing R.
Once within R, you can install GENESPACE (if you haven’t done so already).
devtools::install_github("jtlovell/GENESPACE", upgrade = F)
If you have already installed a previous version of GENESPACE you may need to run:
detach("package:GENESPACE", unload = TRUE)
devtools::install_github("jtlovell/GENESPACE", upgrade = F)
Then import GENESPACE into the R environment:
library(GENESPACE)
## GENESPACE v1.1.10: synteny and orthology constrained comparative
## genomics
Set the paths for your run. Change these paths to those valid on your system
genomeRepo <- "~/path/to/store/rawGenomes"
wd <- "~/path/to/genespace/workingDirectory"
path2mcscanx <- "~/path/to/MCScanX/"
Download the data of interest. Here we are going to use human, mouse, platypus, chicken, and sand lizard. To do this programatically, get the URLs to the genome annotations, but leave off the ‘genomic.gff.gz’ string at the end, then make the full paths by appending ‘translated_cds.faa.gz’ and ‘genomic.gff.gz’ to the ends of the paths. Finally make the directories to write to and download the files.
urls <- c(
human ="000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_",
mouse = "000/001/635/GCF_000001635.27_GRCm39/GCF_000001635.27_GRCm39_",
platypus = "004/115/215/GCF_004115215.2_mOrnAna1.pri.v4/GCF_004115215.2_mOrnAna1.pri.v4_",
chicken = "016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_",
sandLizard = "009/819/535/GCF_009819535.1_rLacAgi1.pri/GCF_009819535.1_rLacAgi1.pri_")
genomes2run <- names(urls)
urls <- file.path("https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF", urls)
translatedCDS <- sprintf("%stranslated_cds.faa.gz", urls)
geneGff <- sprintf("%sgenomic.gff.gz", urls)
names(translatedCDS) <- genomes2run
names(geneGff) <- genomes2run
writeDirs <- file.path(genomeRepo, genomes2run)
names(writeDirs) <- genomes2run
for(i in genomes2run){
print(i)
if(!dir.exists(writeDirs[i]))
dir.create(writeDirs[i])
download.file(
url = geneGff[i],
destfile = file.path(writeDirs[i], basename(geneGff[i])))
download.file(
url = translatedCDS[i],
destfile = file.path(writeDirs[i], basename(translatedCDS[i])))
}
GENESPACE requires that fasta headers exactly match the fourth (ID)
column in the bed file. Feel free to do this yourself, or use the
GENESPACE function parse_annotations.
genomes2run <- c("human", "mouse", "platypus", "chicken", "sandLizard")
parsedPaths <- parse_annotations(
rawGenomeRepo = genomeRepo,
genomeDirs = genomes2run,
genomeIDs = genomes2run,
presets = "ncbi",
genespaceWd = wd)
## human: n unique sequences = 20627, n matched to gff = 20627 (dropped 88 ./- chars)
## mouse: n unique sequences = 22852, n matched to gff = 22852 (dropped 254 ./- chars)
## platypus: n unique sequences = 18252, n matched to gff = 18252
## chicken: n unique sequences = 18093, n matched to gff = 18093 (dropped 1 ./- chars)
## sandLizard: n unique sequences = 20364, n matched to gff = 20364
See APPENDIX A: annotation formats for more information on how the annotations should be stored and prepared.
init_genespaceThe function init_genespace does most of the heavy
lifting in terms of checking to make sure that the input data is OK. It
also produces the correct directory structure and corresponding paths
for the GENESPACE run. We have it as a separate function from the main
pipeline run_genespace so that we can catch issues with
input and give the user flexibility to alter input without re-running
the full pipeline.
gpar <- init_genespace(
wd = wd,
path2mcscanx = path2mcscanx)
## Checking Working Directory ... PASS: `~/Desktop/gs_v1_runs/test4riptut/run`
## Checking user-defined parameters ...
## Genome IDs & ploidy ...
## chicken : 1
## human : 1
## mouse : 1
## platypus : 1
## sandLizard: 1
## Outgroup ... NONE
## n. parallel processes ... 5
## collinear block size ... 5
## collinear block search radius ... 25
## n gaps in collinear block ... 5
## synteny buffer size... 100
## only orthogroups hits as anchors ... TRUE
## n secondary hits ... 0
## Checking annotation files (.bed and peptide .fa):
## chicken : 18093 / 18093 geneIDs exactly match (PASS)
## human : 20627 / 20627 geneIDs exactly match (PASS)
## mouse : 22852 / 22852 geneIDs exactly match (PASS)
## platypus : 18252 / 18252 geneIDs exactly match (PASS)
## sandLizard: 20364 / 20364 geneIDs exactly match (PASS)
## Checking dependencies ...
## Found valid path to OrthoFinder v2.54: `orthofinder`
## Found valid path to DIAMOND2 v2.014: `diamond`
## Found valid MCScanX_h executable: `/Users/lovell/Desktop/programs/MCScanX//MCScanX_h`
See APPENDIX B GENESPACE parameters for more information on how the run can be customized. However, the defaults should work for most circumstances.
When called, init_genespace first checks that the
annotations conform as above in the /bed and
/peptide directories of the genespace working directory
wd. It then builds out the proper directory structure for
the GENESPACE run, which are initially empty. The directory is structed
as follows:
/wd
└───bed # see above - where the parsed bed annotations go
└───peptide # see above - where the parsed peptide fastas go
└───orthofinder # where the raw orthofinder results reside
└───results # where the parsed orthofinder results and other output
└───syntenicHits # parsed blast hits and syntenic hits stored here
└───dotplots # all blast and syntenic block dotplots
└───tmp # special directory for temporary or internal files
└───pangenes # intermediate and final pan-gene construction data
└───riparian # riparian plots and their source data incl. phased blocks
Using default settings, depending on your machine, this can take a few minutes up to an hour.
GENESPACE prints a ton of output. We don’t provide a mechanism to suppress this because it is crucial to make sure things are looking ok. There are some descriptions about the text printed to the console (see below output …)
out <- run_genespace(gpar, overwrite = T)
##
## ############################
## 1. Running orthofinder (or parsing existing results)
## Checking for existing orthofinder results ...
## ... found existing run, not re-running orthofinder
## re-ordering genomeIDs by the species tree: human, mouse,
## platypus, chicken, sandLizard
##
## ############################
## 2. Combining and annotating bed files w/ OGs and tandem array info ...
## ##############
## Flagging chrs. w/ < 10 unique orthogroups
## ...chicken : 475 genes on 55 small chrs.
## ...human : 7 genes on 5 small chrs.
## ...mouse : 1 genes on 1 small chrs.
## ...platypus : 197 genes on 50 small chrs.
## ...sandLizard: 2 genes on 2 small chrs.
## ##############
## Flagging over-dispered OGs
## ...chicken : 167 genes in 6 OGs hit > 8 unique places
## ...human : 148 genes in 5 OGs hit > 8 unique places
## ...mouse : 156 genes in 5 OGs hit > 8 unique places
## ...platypus : 345 genes in 6 OGs hit > 8 unique places
## ...sandLizard: 438 genes in 8 OGs hit > 8 unique places
## ##############
## Annotation summaries (after exclusions):
## ...chicken : 17451 genes in 15561 OGs || 2076 genes in 389 arrays
## ...human : 20473 genes in 17730 OGs || 3045 genes in 835 arrays
## ...mouse : 22695 genes in 18048 OGs || 5093 genes in 956 arrays
## ...platypus : 17815 genes in 16296 OGs || 1671 genes in 494 arrays
## ...sandLizard: 19924 genes in 17093 OGs || 2928 genes in 778 arrays
##
## ############################
## 3. Combining and annotating the blast files with orthogroup info ...
## # Chunk 1 / 3 (12:44:39) ...
## ...mouse v. mouse: total hits = 271842, same og = 82158
## ...sandLizard v. sandLizard: total hits = 228821, same og = 52316
## ...mouse v. sandLizard: total hits = 292554, same og = 18892
## ...mouse v. human: total hits = 290818, same og = 28831
## ...mouse v. platypus: total hits = 276135, same og = 21848
## # Chunk 2 / 3 (12:44:42) ...
## ...sandLizard v. human: total hits = 263514, same og = 18331
## ...human v. human: total hits = 220601, same og = 47634
## ...sandLizard v. platypus: total hits = 251588, same og = 17834
## ...human v. platypus: total hits = 240331, same og = 22339
## ...platypus v. platypus: total hits = 188424, same og = 38043
## # Chunk 3 / 3 (12:44:44) ...
## ...sandLizard v. chicken: total hits = 233566, same og = 16908
## ...mouse v. chicken: total hits = 249754, same og = 15350
## ...chicken v. chicken: total hits = 190677, same og = 56494
## ...human v. chicken: total hits = 224848, same og = 15143
## ...platypus v. chicken: total hits = 207314, same og = 14896
## ##############
## Generating dotplots for all hits ... Done!
##
## ############################
## 4. Flagging synteny for each pair of genomes ...
## # Chunk 1 / 3 (12:45:22) ...
## ...mouse v. human: 22261 hits (15782 anchors) in 357 blocks (250 SVs, 233 regions)
## ...human v. platypus: 18500 hits (13668 anchors) in 593 blocks (447 SVs, 395 regions)
## ...mouse v. platypus: 18295 hits (13622 anchors) in 604 blocks (410 SVs, 453 regions)
## ...mouse v. sandLizard: 17586 hits (12649 anchors) in 746 blocks (595 SVs, 467 regions)
## ...sandLizard v. human: 17543 hits (12572 anchors) in 684 blocks (581 SVs, 382 regions)
## # Chunk 2 / 3 (12:45:48) ...
## ...sandLizard v. platypus: 17530 hits (12931 anchors) in 677 blocks (550 SVs, 413 regions)
## ...sandLizard v. chicken: 16770 hits (12542 anchors) in 496 blocks (436 SVs, 254 regions)
## ...mouse v. chicken: 15868 hits (11705 anchors) in 660 blocks (524 SVs, 436 regions)
## ...human v. chicken: 15761 hits (11755 anchors) in 589 blocks (508 SVs, 360 regions)
## ...platypus v. chicken: 15794 hits (11845 anchors) in 593 blocks (491 SVs, 364 regions)
## # Chunk 3 / 3 (12:46:10) ...
## ...mouse v. mouse: 103561 hits (22798 anchors) in 23 blocks (0 SVs, 0 regions)
## ...chicken v. chicken: 56904 hits (18084 anchors) in 96 blocks (0 SVs, 0 regions)
## ...sandLizard v. sandLizard: 60533 hits (20364 anchors) in 23 blocks (0 SVs, 0 regions)
## ...human v. human: 58374 hits (20571 anchors) in 30 blocks (0 SVs, 0 regions)
## ...platypus v. platypus: 40713 hits (18240 anchors) in 82 blocks (0 SVs, 0 regions)
##
## ############################
## 5. Building synteny-constrained orthogroups ...
## Done!
##
## ############################
## 6. Integrating syntenic positions across genomes ...
## ##############
## Generating syntenic dotplots ... Done!
## ##############
## Interpolating syntenic positions of genes ...
## human: (0 / 1 / 2 / >2 syntenic positions)
## chicken : 1823 / 13763 / 178 / 0
## human : 0 / 20627 / 0 / 0
## mouse : 621 / 17880 / 57 / 0
## platypus : 1288 / 15094 / 249 / 7
## sandLizard: 2368 / 15095 / 311 / 0
## mouse: (0 / 1 / 2 / >2 syntenic positions)
## chicken : 1845 / 13568 / 350 / 1
## human : 695 / 17402 / 166 / 0
## mouse : 0 / 22852 / 0 / 0
## platypus : 1455 / 14952 / 222 / 9
## sandLizard: 2336 / 14899 / 531 / 8
## platypus: (0 / 1 / 2 / >2 syntenic positions)
## chicken : 1696 / 13699 / 357 / 12
## human : 1854 / 16060 / 344 / 5
## mouse : 2160 / 16169 / 229 / 0
## platypus : 0 / 18252 / 0 / 0
## sandLizard: 2004 / 15339 / 425 / 6
## chicken: (0 / 1 / 2 / >2 syntenic positions)
## chicken : 0 / 18089 / 0 / 0
## human : 3477 / 14638 / 148 / 0
## mouse : 3674 / 14684 / 182 / 18
## platypus : 2628 / 13909 / 101 / 0
## sandLizard: 2537 / 15104 / 133 / 0
## sandLizard: (0 / 1 / 2 / >2 syntenic positions)
## chicken : 1344 / 14377 / 43 / 0
## human : 2729 / 15204 / 325 / 5
## mouse : 2999 / 15269 / 290 / 0
## platypus : 1823 / 14519 / 296 / 0
## sandLizard: 0 / 20364 / 0 / 0
## Done!
##
## ############################
## 7. Final block coordinate calculation and riparian plotting ...
## ##############
## Calculating syntenic blocks by reference chromosomes ...
## n regions (aggregated by 25 gene radius): 7658
## n blocks (collinear sets of > 5 genes): 11188
## ##############
## Building ref.-phased blks and riparian plots for haploid genomes:
## chicken : 7593 phased blocks
## human : 8203 phased blocks
## mouse : 9229 phased blocks
## platypus : 8991 phased blocks
## sandLizard: 8127 phased blocks
## Done!
##
## ############################
## 8. Constructing syntenic pan-gene sets ...
## human : n pos. = 27157, synOgs = 91720, array mem. = 8632, NS orthos 21192
## mouse : n pos. = 27235, synOgs = 91127, array mem. = 9444, NS orthos 21332
## platypus : n pos. = 27290, synOgs = 93067, array mem. = 7680, NS orthos 21341
## chicken : n pos. = 27185, synOgs = 93991, array mem. = 6510, NS orthos 21431
## sandLizard: n pos. = 27245, synOgs = 93422, array mem. = 7368, NS orthos 21196
##
## ############################
## GENESPACE run complete! All results are stored in
## /Users/lovell/Desktop/gs_v1_runs/test4riptut/run in the following
## subdirectories:
## syntenic block dotplots: /dotplots (...synHits.pdf)
## annotated blast files : /syntenicHits
## annotated/combined bed : /results/combBed.txt
## syntenic block coords. : /results/blkCoords.txt
## syn. blk. by ref genome: /riparian/refPhasedBlkCoords.txt
## pan-genome annotations : /pangenes (...pangenes.txt.gz)
## riparian plots : /riparian
## genespace param. list : /results/gsParams.rda
## ############################
## **NOTE** the genespace parameter object is returned or can be loaded
## into R via
## `load('/Users/lovell/Desktop/gs_v1_runs/test4riptut/run/results/gsParams.rda',
## verbose = TRUE)`. Then you can customize your riparian plots by
## calling `plot_riparian(gsParam = gsParam, ...)`. The source
## data and ggplot2 objects are also stored in the /riparian
## directory and can also be accessed by `load(...)`.
## **NOTE** To query genespace results by position or gene, use
## `query_genespace(...)`. See specifications in ?query_genespace
## for details.
## ############################
See APPENDIX C: GENESPACE synteny methods and APPENDIX D: run_genespace pipeline for more details about what GENESPACE does under the hood. Information in APPENDIX C explaining the verbose output printed to the console too.
We can visualize synteny in two ways:
If you have more than 2 genomes, plot_riparian give you
access to an information-dense synteny map. See the riparian guide
tutorial for more information.
ripd <- plot_riparian(
gsParam = out,
refGenome = "human",
useRegions = FALSE)
For just a pair of genomes, its often useful to visualize global
patterns of synteny through pairwise x-y dotplots. We can do this with
GENESPACE using ggdotplot. This produces three dotplots:
(1) all hits above a score threshold, (2) all hits where the query and
target are in the same orthogroup, (3) syntenic anchor hits colored by
syntenic block ID. The first two are heatmaps to reduce file size. The
brighter the color, the more hits underlie that point. The title
explains the binning and thresholding for the heatmaps.
hits <- read_allBlast(
filepath = file.path(out$paths$syntenicHits,
"mouse_vs_human.allBlast.txt.gz"))
ggdotplot(hits = hits, type = "all", verbose = FALSE)
Currently, GENESPACE has three querying utilities: block coordinates, pan-genes and syntenic hits. In all cases, the input is the GENESPACE parameter object and a bed file with the coordinates of the region of interest.
Here, we’ll do the query on a single bed file-like data.frame object containing three regions of interest:
roi <- data.frame(
genome = c("human", "human", "chicken"),
chr = c("6", "X", "Z"),
start = c(3.8e6, 0, 0),
end = c(4.5e6, Inf, 1e6))
For each line in the regions of interest bed parameters,
we extract all hits that map to it. We can either pull ALL hits or just
those in synteny. Since synteny is so strong in this set, its reasonable
to look only at syntenic hits
qreturn <- query_hits(gsParam = out, bed = roi, synOnly = TRUE)
We offer gghits as a method to make a dotplot from hits
stored in memory as a data.table. We can do this here for one of the
regions of interest: just human and mouse X chr:
xvx <- subset(qreturn[["human, X: 0-Inf"]], genome2 == "mouse")
gghits(xvx, useOrder = F)
And just human v. the chromosome that the human X is syntenic to in chicken:
xvx <- subset(qreturn[["human, X: 0-Inf"]], genome2 == "chicken")
gghits(xvx, useOrder = T)
A set of pan-genes are in the same orthogroup and can be placed into
a synteny position against a reference genome. We can extract these as
above using query_pangenes:
test <- query_pangenes(
gsParam = out, bed = roi)
Here is what a few lines of each region looks like (‘…’ indicate more genes are there, but for similicity, we are just showing the first):
## [1] "human, 6: 3800000-4500000"
## pgID interpChr interpOrd human mouse platypus chicken sandLizard
## 1 8418 6 7154.0 FAM50B Fam50b FAM50A* FAM50A*
## 2 8419 6 7154.5 LOC117049681
## 3 8420 6 7154.5 ZNF219* Zfp219* LOC100079925 ZNF219*
## 4 8421 6 7157.0 HLA-DQB2.... H2-Eb1, .... LOC10008....
## 5 8422 6 7158.0 PRPF4B Prpf4b PRPF4B, .... PRPF4B PRPF4B
## [1] "human, X: 0-Inf"
## pgID interpChr interpOrd human mouse platypus chicken sandLizard
## 1 1 X 2 LOC12490.... Ppp2r3d* PPP2R3B PPP2R3B PPP2R3B
## 2 2 X 3 TCEAL6
## 3 3 X 4 NKRF Nkrf* LOC10007.... NKRF* NKRF*
## 4 4 X 5 CCDC120 Ccdc120* CCDC120* CCDC120*
## 5 5 X 6 CT47B1
## [1] "chicken, Z: 0-1e+06"
## pgID interpChr interpOrd human mouse platypus chicken sandLizard
## 1 34 Z 110.0000 MYO5C* Myo5c* MYO5C* LOC121108491 MYO5C*
## 2 35 Z 111.0000 SKA1 Ska1 SKA1 SKA1, LO.... SKA1
## 3 36 Z 111.3333 MAPK4 Mapk4 MAPK4 MAPK4
## 4 37 Z 111.6667 MRO Mro MRO
## 5 38 Z 112.0000 ME2 Me2 ME2 LOC10705.... ME2
See the riparian plotting guide for more details, but we can add a color column to the bed and only look at these regions, either with the full synteny map as a background:
roibed <- roi[,c("genome", "chr")]
roibed$color <- c("pink", "cyan", "green")
ripd <- plot_riparian(
gsParam = out,
useRegions = FALSE,
highlightBed = roibed)
Or just looking at the focal regions of interest:
ripd <- plot_riparian(
gsParam = out,
useRegions = FALSE,
highlightBed = roibed,
backgroundColor = NULL)
See APPENDIX E: output formats for more detail on the raw data output from GENESPACE.
parse_annotationsThis reads raw ncbi-formatted files stored in separate folders for
each genome genomeRepo. Here, this looks like this:
/genomeRepo
└───human
│ │ GCF_000001405.40_GRCh38.p14_translated_cds.faa.gz
│ │ GCF_000001405.40_GRCh38.p14_genomic.gff.gz
└───mouse
│ │ GCF_000001635.27_GRCm39_translated_cds.faa.gz
│ │ GCF_000001635.27_GRCm39_genomic.gff.gz
└───platypus
│ │ GCF_004115215.2_mOrnAna1.pri.v4_translated_cds.faa.gz
│ │ GCF_004115215.2_mOrnAna1.pri.v4_genomic.gff.gz
└───chicken
│ │ GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_translated_cds.faa.gz
│ │ GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_genomic.gff.gz
└───sandLizard
│ │ GCF_009819535.1_rLacAgi1.pri_translated_cds.faa.gz
│ │ GCF_009819535.1_rLacAgi1.pri_genomic.gff.gz
It then takes these files, formats then re-names them into the
GENESPACE working directory wd. The peptide fasta file
stored in
/wd
└───bed
│ │ human.bed
│ │ mouse.bed
│ │ platypus.bed
│ │ chicken.bed
│ │ sandLizard.bed
└───peptide
│ │ human.fa
│ │ mouse.fa
│ │ platypus.fa
│ │ chicken.fa
│ │ sandLizard.fa
The formatted files look like this for the human genome:
The bed file, extracted from the gene gff3 annotation, which has chromosome, start, end and id information for each gene (row. )
19 58345182 58353492 A1BG
10 50799408 50885627 A1CF
12 9067707 9116229 A2M
12 8822620 8887459 A2ML1
1 33306765 33321098 A3GALT2
22 42692120 42721301 A4GALT
3 138123712 138133302 A4GNT
12 53307459 53321610 AAAS
12 125065434 125143316 AACS
3 151814115 151828488 AADAC
The peptide fasta, simplified from the translatedCDS fasta:
>A1BG
MSMLVVFLLLWGVTWGPVTEAAIFYETQPSLWAESESLLKPLANVTLTCQAHLETPDFQLFKNGVAQEPVHLDSPAI...
>A1CF
MEAVCLGTCPEPEASMSTAIPGLKKGNNALQSIILQTLLEKENGQRKYGGPPPGWDAAPPERGCEIFIGKLPRDLFE...
>A2M
MGKNKLLHPSLVLLLLVLLPTDASVSGKPQYMVLVPSLLHTETTEKGCVLLSYLNETVTVSASLESVRGNRSLFTDL...
>A2ML1
MWAQLLLGMLALSPAIAEELPNYLVTLPARLNFPSVQKVCLDLSPGYSDVKFTVTLETKDKTQKLLEYSGLKKRHLH...
>A3GALT2
MALKEGLRAWKRIFWRQILLTLGLLGLFLYGLPKFRHLEALIPMGVCPSATMSQLRDNFTGALRPWARPEVLTCTPW...
>A4GALT
MSKPPDLLLRLLRGAPRQRVCTLFIIGFKFTFFVSIMIYWHVVGEPKEKGQLYNLPAEIPCPTLTPPTPPSHGPTPG...
>A4GNT
MRKELQLSLSVTLLLVCGFLYQFTLKSSCLFCLPSFKSHQGLEALLSHRRGIVFLETSERMEPPHLVSCSVESAAKI...
>AAAS
MCSLGLFPPPPPRGQVTLYEHNNELVTGSSYESPPPDFRGQWINLPVLQLTKDPLKTPGRLDHGTRTAFIHHREQVW...
>AACS
MSKEERPGREEILECQVMWEPDSKKNTQMDRFRAAVGAACGLALESYDDLYHWSVESYSDFWAEFWKFSGIVFSRVY...
>AADAC
MGRKSLYLLIVGILIAYYIYTPLPDNVEEPWRMMWINAHLKTIQNLATFVELLGLHHFMDSFKVVGSFDEVPPTSDE...
For more details, see ?init_genespace. In general, the
user should not have to adjust any of these, as GENESPACE synteny tends
to conform well to the phylogenetic structure of the run. The parameters
are left here so that users can sufficiently modify the GENESPACE run
for edge cases (described below). Here short descriptions of all
parameters that can be given to init_genespace:
genomeIDs: the genome IDs that match files in
/bed and /peptide. These must begin with a
letter (a-z, A-Z) and should not include special characters. ‘.’ and ’_’
are OK.ploidy: What is the ploidy of genome assemblies. This
is usually half of the actual ploidy, that is an inbred diploid usually
is represented by a haploid genome assembly. This is an important
parameter that determines the number of top-scoring blast hits in
orthogroups to be included in the synteny search.ignoreTheseGenomes and outgroup: arguments
are synonyms. Typically only used if there is a whole-genome duplication
that predates the divergence of the genomes of interest, but the user
wants to include homeologs (paralogs resulting from a WGD) among the
focal species. This lets you specify which of the genomeIDs should be
used in the orthofinder run but not in the synteny search.orthofinderInBlk: This is the most important GENESPACE
parameter, which specifies whether orthgroups should be re-called within
syntenic blocks (TRUE) or if orthogroups should be inferred globally
logical. Typically, if there are no WGDs in the history of the genomes,
this results in very little imporvement in performance. However, it can
significantly improve detection of homeologs if there is a WGD. Thus,
default is FALSE unless any of the genomes have ploidy > 1.useHOGs sets whether phylogenetically hierarchical
orthogroups (HOGs) or raw orthogroups are used. Since original -og
orthogroups are explicitly deprecated under the current OrthoFinder
release, this should be set to TRUE except in very rare cases.nGaps: the MCScanX -m parameter,
specifying the number of gaps in the directed acyclic graph. It is not
recommended to mess with this parameters, but larger numbers may results
in more paralogous blocks. Defaults to 5.blkSize: the MCScanX -s and
dbscan(minpts = ...) parameters specifying the minimum
block size. Increasing this can cause much more stringent blocks.
Defaults to 5.blkRadius: the search radius with which to aggregate
syntenic hits into ‘regions’. This is the best parameter to mess with if
you want more detail (lower values) or more simple (higher values)
synteny maps. Defaults to 25.synBuff: the buffer around syntenic anchor hits
(perfectly collinear hits in blocks) used for initial synteny searches.
Larger values may let you extend to adjacent paralagous regions.onlyOgAnchors: sets whether the query and target genes
in syntenic anchors have to be in the same orthogroup. Defaults to TRUE
and probably should stay that way except in rare cases.onlySameChrs logical - should synteny be only
considered between chromosomes with the same names? Not recommended
except with recently diverged polyploids where you want to avoid
spurious syntenic matching between homeologous chromosomes. Also can use
this if just doing a run among synteny-built genomes, where you know
there are no inter-chromosomal SVs. Can dramatically improve speed among
very large genomes.maxOgPlaces specifies the max number of unique
placements that an orthogroup can have before being excluded from
syntenyarrayJump specifies the maximum distance in gene rank
order between two genes in the same tandem arrayIn some cases it may be impossible to include an outgroup to study
paralogs. In this case, GENESPACE offers a set of parameters that
facilitate searches after masking the primary synteny map:
nSecondHits/nSecondaryHits is a scaling
parameter on ploidy, specifying how many additional hits should be
included. nGapsSecond, blkSizeSecond,
blkRadiusSecond, synBuffSecond,
onlyOgAnchorsSecond follow above, but for secondary hits.
maskBuffer numeric (default = 500), the minimum distance
that a secondary (or homeolog w/in polyploid genome) block can be
created relative to an existing block.
nCores the number of parallel processes to run …
defaults to half of those available on the machine. Most processes are
parallelized in GENESPACE.path2orthofinder character string coercible to a file
path that points to the orthofinder executable. If orthofinder is in the
path, specify with “orthofinder”path2diamond character string coercible to a file path
that points to the diamond executable. If diamond is in the path,
specify with “diamond”. Since diamond is included in the OrthoFinder
install, if OrthoFinder is in the $PATH, then so should diamond.path2mcscanx The path to the MCScanx installation. This
must contain the ‘MCScanX_h’ folder.onewayBlast logical of length 1, specifying whether
one-way blasts should be run via orthofinder -1 .... This
replaces orthofinderMethod = “fast”, but uses
diamond2 --more-sensitive whereas the previous method used
–fast specification. Substantial speed improvements in large runs with
little loss of fidelity. However, use carefully since full testing has
not been completed.diamondUltraSens specifies whether the diamond mode run
within orthofinder should be –more-sensitive (default, FALSE)
–ultra-sensitive. Not likely to be necessary, except among very diverged
genomes … these tend to not have synteny anyways.rawOrthofinderDir file.path to the location of an
existing raw orthofinder run. Defaults to the $wd/orthofinder, but can
be any path point to a valid orthofinder run.wd file.path where the analysis will be rundotplots character string either “always”, “never”, or
“check”. Default (check) only writes a dotplot if there are < 10k
unique chromosome combinations (facets). “always” means that dotplots
are made regardless of facet numbers, which can be very slow in some
instances. “never” is by far the fastest method, but also never produces
dotplots.In this case, OrthoFinder has been run previously, so the output is just parsed (this improves rendering speed by 10x for this tutorial).
Here, HOGs from orthofinder are merged with the bed files. With this
data in hand, GENESPACE build ‘tandem arrays’ which are defined as sets
of genes on the same chromosome and in the same orthogroup separated by
no more than arrayJump interleaved genes. It also flags two
sets of problematic genes:
2 * minBlkSize unique
orthogroups. These are unlikely to have any synteny mapping.(max(ploidy)) * 8 unique
positions, separated by no less than arrayJump orthogroups
along a chromosome. Low-quality annotations can often have lots of these
types of orthgroups that hit repetitive or low-complexity genomic
regions everywhere.Genes that fall into either of these two categories are flagged as ‘noAnchor’ and are never used in the synteny searches (but can still be syntenic if proximate to a syntenic anchor).
The gene positions stored in the bed files and orthogroup information
are added to the blast files and stored in /syntenicHits. The total
number of hits and those where the query and target genes are in the
same orthogroup are reported. In addition to building the
syntenicHits/$rawblast.txt.gz text files, this also
populates the dotplots/$rawHits.pdf
See full synteny methods below. For each pair of genomes, the larger genome is the query and the smaller the target. The total syntenic hits and ‘anchors’ are reported. Syntenic anchors are any set of perfectly collinear blast hits as defined first by MCScanX, then cleaned up by dbscan. Syntenic hits (which are the first number reported) are those proximate to the anchors. See above for definitions of blocks and regions. SVs are defined as the number of block breakpoints - the number of unique chromosome combinations that contain syntenic hits.
Output here is verbose only if orthofinderInBlk = TRUE.
Otherwise, this just reads in all the syntenic orthogroup blast hits and
clusters based on the resulting graph.
For each pair of genomes, we conduct linear interpolation of syntenic positions for each gene within the block coordinates. The output reflects the number of overlapping blocks for each gene in each pairwise combination. This should generally be the ploidy of the genome, although can be lower if there are lots of small syntenic blocks (which may have genes in the gaps without a syntenic placement).
By default, we do not construct riparian plots with a polyploid reference - this creates multiple mappings for each color (chromosome) and can be a very confusing and large output file. Because of this, we calculate syntenic blocks phased only against the haploid reference genomes. If there are no haploid references, riparian plots are not constructed.
The syntenic hits and orthogroups are decomposed into syntenic position interpolated pan-gene sets. These consiste of unique interpolated positions against a reference (“n pos”), syntenic orthogroup genes (“synOgs”), members of arrays with genes in synOgs (“array mem”) and non-syntenic orthologs to the single representative gene for each pan-gene set (“NS orthos”).
By default, GENESPACE runs the full OrthoFinder program internally from R. If an orthofinder run is already available (and uses exactly the same genomeIDs as the GENESPACE run), GENESPACE will detect this and not run OrthoFinder.
For very large runs, we highly recommend running OrthoFinder outside
of R. You can do this by using
init_genespace(..., path2orthofinder = NA). When you use
run_genespace with the resulting gs parameters, GENESPACE
will prepare the input files and print the OrthoFinder command to the
console before killing the run. You can then run that command in your
shell or on a server separately.
Once orthofinder is run, you can point to the output directory via
init_genespace(..., rawOrthofinderDir = "/path/to/run"), or
just put it in the /orthofinder directory in your GENESPACE wd.
GENESPACE will find the necessary results, QC them, and if all looks
good, move them to the /results directory and begin the pipeline.
The bed files are concatenated and a genomeID column is added so that all gene position is in one file: /results/combBed.txt. OrthoFinder ofIDs (unique gene identifiers), OGs (orthogroup.tsv) and HOGs (phylogenetically hierarchical orthogroups, N0.tsv) are parsed and added as new columns to the combBed file. This provides all the raw information needed for all further steps.
Problematic genes and chromosomes are flagged by two rules:
chromosomes must have > blkSize (default = 5) unique
orthogroups. Orthogroups must hit < ploidy * maxOgPlaces
(default maxOgPlaces = 8) unique positions in the genome (separated by
> synBuffer, default = 150). Genes in either the
problematic orthogroups or chromosomes are flagged as
noAnchor, meaning they will still be considered for the
pangenome steps, but cannot be used as syntenic anchors or to infer
sytenic positions across genomes.
With HOGs (or OGs if useHOGs = FALSE) in hand, tandem
arrays are calculated as follows: 1) any OG with multiple members on a
single genome-by-chromosome combination are ‘potential tandem arrays’;
2) potential tandem arrays with a gap between members > synBuffer are
split into their own clusters; 3) clusters with > 1 member are tandem
arrays; 4) genes in the most physically central position in the array
are ‘arrayReps’. Gene order is re-calculated for just the array reps and
steps 1-4 are run iteratively until convergence. Genes that are not
array representatives are also falgged as noAnchor.
The information from the combBed file are added to the blast files to
produce annotated blast files, aka synHits. Of particular importance is
whether both the query and target genes are in the same orthogroup
“sameOg” and are not flagged as noAnchor. Reciprocal
query/target blast hits are aggregated into a single file where the
larger genome is the query and the smaller the target.
The information stored in the combined/annotated bed file is merged
with the blast results to make synHits matrices. The function
synteny annotates each unique pairwise synHits file with
three columns: “isAnchor”, “inBuffer”, “isSyntenic” and “blkID”. If
blkID is NA and inBuffer == FALSE, that blast hit is not syntenic.
Otherwise, the hit is syntenic and may be used for direct syntenic
position interpolation if isAnchor == TRUE.
Synteny is inferred by the following pipeline:
Starting in v1.0.9, GENESPACE produces two ‘synHits’ files. The
first, with the format $QUERY_vs_$TARGET.allBlast.txt.gz
contains all blast hits with a set of columns that contain the
information needed for synteny (see below). The function
finalize_synteny (see below) take only hits with TRUE in
the flag isSyntenic and puts these in a simplified text
file with the format: $QUERY_vs_$TARGET.synHits.txt.gz,
which only have the bitScore column from the blast8 specification and
also drops the “sameInblkOg”, “isSyntenic”, and “inBuffer” columns. The
splitting out syntenic hits dramatically improves read/write speed for
all downstream functions that only see syntenic hits.
allBlast files have the following columns:
columns 1-8: annotation positions of the query genes columns 9-16: annotation positions of the target genes columns 17-26: standard blast8-formatted fields col 27 (sameOg): logical, query and target in same orthogroup? col 28 (noAnchor): logical, should this hit never be an anchor? col 29 (isAnchor): logical, is this hit a syntenic anchor? col 30 (inBuffer): logical, is this hit in the synteny buffer? col 31 (isSyntenic): logical, is this hit in the blkRadius? col 32 (regID): character, large region ID of syntenic hits col 33 (blkID): character, blockID of syntenic hits
synHits files have the following columns:
columns 1-8: annotation positions of the query genes columns 9-16: annotation positions of the target genes column 17: bit score from the blast file col 18 (sameOg): logical, query and target in same orthogroup? col 19 (noAnchor): logical, should this hit never be an anchor? col 20 (isAnchor): logical, is this hit a syntenic anchor? col 21 (regID): character, large region ID of syntenic hits col 22 (blkID): character, blockID of syntenic hits
The function syntenic_pangenes converts orthogroup and
position data into a single matrix. This is stored in the /pangenes
directory. This is the raw data that goes into the pangenes output
(position x genome matrix). The pangenes.txt file contains 11
columns:
The function query_pangenes transforms these data into a
more easily read ‘wide’ format: columns 1:5 collow columns 2-5 above.
The remaining columns aggregate genes within each genome (column). Genes
flagged with ’*’ are non-syntenic orthologs; genes flagged with ‘+’ are
array members, but not the most central (representative) gene in the
array.
Here are the first two lines of the combBed.txt file for the human-chicken.
chr start end id ofID pepLen ord genome arrayID isArrayRep globOG globHOG noAnchor og
1: 1 154464218 154465306 LOC418813 0_10171 98 2913 chicken Arr00001 TRUE OG0000006 N0.HOG0000007 OG0000006 FALSE 17571
2: 1 171849899 171861579 LOC112532821 0_8262 1130 3029 chicken Arr00002 TRUE OG0000008 N0.HOG0000009 OG0000008 FALSE 20208
Column definitions:
This file, stored in /results/syntenicBlock_coordinates.csv and /results/syntenicRegion_coordinates.csv, contain pairwise block coordinates for each pair of genomes (genome1, genome2; 1 or 2 appended to the column name indicates that this data is associated with either genome1 or genome2 respectively). For each genome, there are 15 columns:
For riparian plots, blocks need to be ‘phased’ by the reference genome. These coordinates are stored in /riparian/$GENOME_phasedBlks.csv.